Task 1: Dataset overview¶
Goal: preapre a report (.html) that briefly describes the Titanic dataset.
Hint: you can use last lesson's output as a source
.ipynbfile (or download one from the GitHub repository.)subtask 1 - Convert the document using
jupyter nbconvert(see documentation for reference).- Hint: rerender the document after each next step to see the progress.
- Hint: keep your original document to be able to compare it with the future versions. (You can do it using
--output my_old_nboption.)
subtask 2 - Let the document be recalculated once it's rendering. Use
--executeoption.subtask 3 - Get rid of the code chunks in the converted document. Use
--no-inputoption.subtask 4 - Add some structure to the document
- Use Markdown tags for headers - e.g. Goal, Data, Columns, ..., Summary). You may add some text as well (e.g. try to describe how to read a graph you've included).
- Be sure to insert a (top level) header
# Columnsand add several subsections (## Age,## Sex, ...). It will be necessary to solve following problems. Note: it would be great to include some plot/table here (e.g. to describe outcome ~ column relation). But any lorem ipsum is enough to test the stuff.
Task 2: Dataset overview using the pretty-jupyter¶
Goal: use pretty-jupyter Python package to produce pretier reports. See documentation for reference).
- subtask 1 - Add
template pjto thenbconvertcommand. (You need to havepretty-jupyterinstalled!)- Hint: Now you can get rid off
--no-inputoption, thepretty-jupyterdeals with that for you. - Hint: Among other things, it gives you table of contents and prettier font and more.
- Hint: Now you can get rid off
- subtask 2 - Add report metadata
author: Me
title: Document name
date: 2022-01-01
- subtask 3 - Use
tabsetsto convert column description desctions into panels.- Add
<span class='pj-token' style='display: none;'>.tabset)|O_O|just|O_O|after|O_O|the|O_O|#|O_O|Columns|O_O|header|O_O|(see|O_O|docs|O_O|for|O_O|reference.
- Add
- subtask 4 - Use dynamically rendered text, e.g. try to describe datasets dimensions but do not hardcode them.
- Add a magic
%load_ext pretty_jupyterin some cell (it could be in the same cell you load packages.) - Store the desired values in some variables, e.g.
n_rows = data.shape[0]. - Insert a code cell and add
%%jmd(or%%jinja markdown) on the very first line of the cell. - Write down your text into the cess. Add
{{ n_rows }}to provide the variable value. - Optionally, you may try to add a data_frame (hint: you need to convert it to html first - use
{{ df.to_html() }}) or a plot (usematplotlib_fig_to_html()function frompretty_jupyter.helpersto do so:{{ matplotlib_fig_to_html(plt) }}). See docs for reference.
- Add a magic
Task 3 - Dataset overview using pandas-profiling¶
Goal: do the dataset profiling using pandas-profiling python package.
- subtask 1 - Use
pandas-profilingCMD tool to profiletitanic_train.csvdataset.- run
pandas_profiling --title "Můj Titanic Train" data/titanic_train.csv. (You can locate the file in the same directory as the input data.)
- run
- subtask 2 - Use
pandas-profilingpackage to profiletitanic_train.csvdataset.
import pandas as pd
from pandas_profiling import ProfileReport
titanic_train = pd.read_csv("../data/titanic2/titanic_train.csv")
titanic_profile = ProfileReport(titanic_train, title="Titanic Profiling Report")
titanic_profile.to_file("my_titanic_report.html")
- subtask 3 - Get rid of
correlationsection (as it may be time-consuming). Or keep just one.
ProfileReport(titanic_train, title="Titanic Profiling Report", correlations={
"auto": {"calculate": False},
"pearson": {"calculate": False},
"spearman": {"calculate": False},
"kendall": {"calculate": False},
"phi_k": {"calculate": False},
"cramers": {"calculate": False},
})
- subtask 4 - Customize the report a bit. E.g., you can try:
- change theme -
html.style.theme(eitherflatlyorunited) - base color -
html.style.primary_color(enter a hex code) - ... (see docs for more details).
- change theme -
Task 4 - Dataset overview using Quarto¶
Goal: do the dataset profiling using Quarto. (Of course, you need to isntall it first -- se the download section for reference).
- subtask 1 - Convert the
.ipynbinto.htmlusingQuarto.- Use the CMD utility
quarto render <my_document.ipynb> --to html.
- Use the CMD utility
- subtask 2 - (Optional; best to use VSCode + Quarto extension to do so)
- Create a very new
.qmdfile. - You can mock-up the content using e.g. hello-world example. (You may need
plotlypackage as well.) - Try to convert the document (use
ctrl + shift + kshortcut in VSCode). - Try to tweak the metadata a bit (it's simillar to
pretty-jupyterraw cell.). Set e.g.- theme
- toc (incl. toc position, toc title, etc.)
- abstract
- ...
- Hint: docs for reference.
- Hint: VSCode supports intellisense to help you with that.
- Try to add cell metadata (see docs for reference).
- Add
fig-labeland use it in text. - Add
eval: falseto disable evaluation. - Add
code-line-numbersto add (surprise, surprise) line numbers to the code chunk. - ...
- Add
- Create a very new
Task 5 - Exploratory analysis¶
Goal: prepare a brief exploratory analysis of the Titanic (train) dataset. Do cover following questions:
- What is the outcome distribution (
survived)? - What is the relation between the outcome and each individual column? (Visualise it!)
- Are the missing at random or does it (somehow) correspond to the outcome?
- ... (add your own questions.)
Do not forget to structure the report, describe input data and summarise your insights. Follow tips covered in lecture (e.g. use of colors, text tescription of charts, charts annotations, ...)